Skip to content

Conversation

@jfinkels
Copy link
Collaborator

@jfinkels jfinkels commented Jan 18, 2026

Fix the way form feed characters are interpreted by changing the way
lines and pages are found. Before this commit, a file comprising two
form feed characters (/f/f) would result in too few trailing newlines
at the end of the second page. After this change, each page is produced
with the correct number of lines.

This merge request changes the way files are read, replacing complex iterators
with a loop-based approach, iteratively scanning for newline or form
feed characters. The memchr library is used to efficiently scan for
these two characters. One downside of this implementation is that it
currently reads the entire input file into memory; this can be improved
in subsequent merge requests. The behavior of uutils pr is still quite different from GNU pr, so hopefully this is acceptable as a temporary debt to make many more straightforward improvements.

This merge request also removes a unit test that was enforcing incorrect behavior. These
tests were ostensibly designed to match corresponding ones in the GNU
test suite, but most of them don't match the current behavior of GNU pr, so it
does not seem valuable to keep them. As we improve uutils pr, we can add more targeted and minimal test cases to replace them.

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 18, 2026

Merging this PR will degrade performance by 3.26%

⚡ 7 improved benchmarks
❌ 1 regressed benchmark
✅ 274 untouched benchmarks
⏩ 38 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory sort_key_field[500000] 51.8 MB 45.8 MB +12.95%
Memory sort_accented_data[500000] 28.3 MB 22.1 MB +27.85%
Memory sort_long_line[160000] 982.3 KB 724.4 KB +35.59%
Memory sort_ascii_only[500000] 28.3 MB 22.2 MB +27.78%
Memory sort_case_sensitive[500000] 17.4 MB 16.9 MB +3.03%
Memory sort_mixed_data[500000] 27.2 MB 22.5 MB +20.7%
Memory dd_copy_partial 129.1 KB 133.5 KB -3.26%
Memory cp_large_file[16] 119.5 KB 112.6 KB +6.13%

Comparing jfinkels:pr-simplify (aedca01) with main (0e50947)

Open in CodSpeed

Footnotes

  1. 38 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/shuf/shuf-reservoir (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/sort/sort-stale-thread-mem (passes in this run but fails in the 'main' branch)

1 similar comment
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/shuf/shuf-reservoir (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/sort/sort-stale-thread-mem (passes in this run but fails in the 'main' branch)

Remove unit test for `pr` that was enforcing incorrect behavior. These
tests were ostensibly designed to match corresponding ones in the GNU
test suite, but they don't match the current behavior of GNU `pr`, so it
is not valuable to keep them.
Fix the way form feed characters are interpreted by changing the way
lines and pages are found. Before this commit, a file comprising two
form feed characters (`/f/f`) would result in too few trailing newlines
at the end of the second page. After this change, each page is produced
with the correct number of lines.

This commit changes the way files are read, replacing complex iterators
with a loop-based approach, iteratively scanning for newline or form
feed characters. The `memchr` library is used to efficiently scan for
these two characters. One downside of this implementation is that it
currently reads the entire input file into memory; this can be improved
in subsequent merge requests.
@jfinkels jfinkels changed the title pr: simplify the implementation of reading pages pr: uniformly scan for form feed and newline chars Jan 18, 2026
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/tty/tty-eof (passes in this run but fails in the 'main' branch)

@jfinkels jfinkels marked this pull request as ready for review January 18, 2026 15:44
Comment on lines +768 to +776
// TODO Read incrementally.
let buf = if path == "-" {
let mut f = stdin();
let mut buf = vec![];
f.read_to_end(&mut buf)?;
buf
} else {
std::fs::read(path)?
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated code with get_file_line_groups
please fix it in a different pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> Result<Self, FromUtf8Error> {
// TODO Don't read bytes to String just to directly write them
// out again anyway.
let line_content = String::from_utf8(buf.to_vec())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't work with buff directly ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to accommodate some existing code in this file that expects a String. It should be okay to just use the bytes, I will fix that in a new merge request. Thanks

@sylvestre sylvestre merged commit 1a81d1b into uutils:main Jan 18, 2026
156 of 157 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants